ingest/ledgerbackend: Decouple meta pipe buffering in CaptiveStellarCore backend #3187

bartekn · 2020-11-03T19:41:43Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've updated the relevant CHANGELOG (here for Horizon) if
needed with deprecations, added features, breaking changes, and DB schema changes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

This commit introduces bufferedLedgerMetaReader which decouples buffering and unmarshaling from stellarCoreRunner and CaptiveStellarCore.

Why

bufferedLedgerMetaReader fixes multiple issues:

It fixes ingest/ledgerbackend: Debug slow ingestion during catchup in captive-core #3132 by increasing internal buffers' sizes to hold more ledgers. It makes catchup code much faster.
It fixes Improve the detection of captive-core failures #3158 - bufferedLedgerMetaReader allowed rewriting shutdown code to a much simpler version. Now bufferedLedgerMetaReader and CaptiveStellarCore listen to a single shutdown signal: stellarCoreRunner.getProcessExitChan(). When Stellar-Core process terminates bufferedLedgerMetaReader.Start go routine will stop and CaptiveStellarCore will return a user friendly error in PrepareRange and GetLedger methods. When CaptiveStellarCore.Close() is called, it kills the Stellar-Core processing triggering shutdown code explained above.
Decouple buffering and unmarshaling into a single struct. This makes stellarCoreRunner and CaptiveStellarCore simpler.
It fixes a possible OOM issue when network closes a series of large ledgers. In such case bufferedLedgerMetaReader will wait for a buffer to be consumed first before reading more ledgers into memory preventing an increased memory usage.

Known limitations

No tests for bufferedLedgerMetaReader. I'll add them after initial approval.

tamirms · 2020-11-05T11:30:07Z

ingest/ledgerbackend/buffered_meta_pipe_reader.go

+			// close it now because there may be some ledgers in a buffer.
+			select {
+			case b.c <- metaResult{nil, err}:
+			case <-b.runner.getProcessExitChan():


I wonder if we can simplify the shutdown logic further by using contexts and passing them down to each component down to the stellarCoreRunner instance. Within stellarCoreRunner we can use https://golang.org/pkg/os/exec/#CommandContext to execute stellar core. Perhaps we can make an issue separate from this PR to investigate if using contexts would help.

I looked into using CommandContext and I think it won't help much. We still need close() method to do some cleanup and we can't do it with context alone. But we can definitely investigate how to write a shutdown code for more complicated object connections. For example, to figure out how to close CaptiveCore, bufferedLedgerMetaReader and stellarCoreRunner I drew a graph to understand the dependencies.

Issue: #3200.

tamirms · 2020-11-05T12:18:24Z

ingest/ledgerbackend/buffered_meta_pipe_reader.go

+	// usually wait for a specific time duration before checking if the ledger is
+	// available. When catching up and small buffer this can increase the overall
+	// time because ledgers are not available.
+	ledgerReadAheadBufferSize = 100


in the worst case scenario we can have 1000mb of ledgers in the queue. that scenario is unlikely to happen. however, if you're interested, it should be possible to ensure the queued ledgers don't exceed a fixed threshold of memory usage by using a circular buffer:

diff --git a/ingest/ledgerbackend/buffered_meta_pipe_reader.go b/ingest/ledgerbackend/buffered_meta_pipe_reader.go index 14707c76..0619e16e 100644 --- a/ingest/ledgerbackend/buffered_meta_pipe_reader.go +++ b/ingest/ledgerbackend/buffered_meta_pipe_reader.go @@ -2,6 +2,7 @@ package ledgerbackend import ( "bufio" + "fmt" "io" "time" @@ -34,8 +35,59 @@ const ( // available. When catching up and small buffer this can increase the overall // time because ledgers are not available. ledgerReadAheadBufferSize = 100 + maxBytesBuffered = 100 * 1024 * 1024 // 100mb ) +type stream struct { + queue []uint32 + length int + sum uint32 + idx int +} + +func newStream(capacity int) *stream { + return &stream{ + queue: make([]uint32, capacity), + } +} + +func (s *stream) seek(num int) error { + if num > s.length { + return fmt.Errorf("seek %v exceeds queue length %v", num, s.length) + } + + for num > 0 { + s.sum -= s.queue[s.idx] + s.length-- + s.idx = (s.idx + 1) % len(s.queue) + } + return nil +} + +func (s *stream) add(val uint32) { + s.sum += val + tail := (s.idx + s.length) % len(s.queue) + if s.length == len(s.queue) { + s.sum -= s.queue[s.idx] + s.idx = (s.idx + 1) % len(s.queue) + } else { + s.length++ + } + s.queue[tail] = val +} + +func (s *stream) tailSum(n int) (uint32, error) { + if n > s.length { + return 0, fmt.Errorf("tail %v exceeds queue length %v", n, s.length) + } + if n < s.length { + if err := s.seek(s.length - n); err != nil { + return 0, err + } + } + return s.sum, nil +} + type metaResult struct { *xdr.LedgerCloseMeta err error @@ -64,6 +116,7 @@ type metaResult struct { type bufferedLedgerMetaReader struct { r *bufio.Reader c chan metaResult + queue *stream runner stellarCoreRunnerInterface } @@ -72,6 +125,7 @@ type bufferedLedgerMetaReader struct { func newBufferedLedgerMetaReader(runner stellarCoreRunnerInterface) bufferedLedgerMetaReader { return bufferedLedgerMetaReader{ c: make(chan metaResult, ledgerReadAheadBufferSize), + queue: newStream(ledgerReadAheadBufferSize), r: bufio.NewReaderSize(runner.getMetaPipe(), metaPipeBufferSize), runner: runner, } @@ -93,7 +147,14 @@ func (b *bufferedLedgerMetaReader) readLedgerMetaFromPipe() (*xdr.LedgerCloseMet } } - for frameLength > metaPipeBufferSize && len(b.c) > 0 { + for { + totalBytes, err := b.queue.tailSum(len(b.c)) + if err != nil { + return nil, errors.Wrap(err, "could not obtain buffer size") + } + if totalBytes == 0 || totalBytes+frameLength <= maxBytesBuffered { + break + } // Wait for LedgerCloseMeta buffer to be cleared to minimize memory usage. select { case <-b.runner.getProcessExitChan(): @@ -117,6 +178,8 @@ func (b *bufferedLedgerMetaReader) readLedgerMetaFromPipe() (*xdr.LedgerCloseMet return nil, err } } + + b.queue.add(frameLength) return &xlcm, nil }

It makes sense but because this is unlikely I'd vote for simply making the ledgerReadAheadBufferSize smaller, say, 20 (so the maximum RAM usage is 200 MB). We can open an issue to do it in the future.

Changed in ee3226c.

bufferedLedgerMetaReader

3c4aae8

cla-bot bot added the cla: yes label Nov 3, 2020

bartekn added 2 commits November 4, 2020 17:18

Fix tests

89fa9c5

Fix shadow

09fd2cd

bartekn requested a review from a team November 4, 2020 16:37

bartekn marked this pull request as ready for review November 4, 2020 16:37

tamirms reviewed Nov 5, 2020

View reviewed changes

bartekn mentioned this pull request Nov 9, 2020

Helper for more complicated shutdown dependencies #3200

Closed

Decrease ledgerReadAheadBufferSize buffer

ee3226c

tamirms approved these changes Nov 9, 2020

View reviewed changes

Merge branch 'master' into bufferedLedgerMetaReader

2929853

bartekn merged commit a10c000 into stellar:master Nov 9, 2020

bartekn deleted the bufferedLedgerMetaReader branch November 9, 2020 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/ledgerbackend: Decouple meta pipe buffering in CaptiveStellarCore backend #3187

ingest/ledgerbackend: Decouple meta pipe buffering in CaptiveStellarCore backend #3187

bartekn commented Nov 3, 2020 •

edited

Loading

tamirms Nov 5, 2020

bartekn Nov 9, 2020

bartekn Nov 9, 2020

tamirms Nov 5, 2020 •

edited

Loading

bartekn Nov 9, 2020

bartekn Nov 9, 2020

ingest/ledgerbackend: Decouple meta pipe buffering in CaptiveStellarCore backend #3187

ingest/ledgerbackend: Decouple meta pipe buffering in CaptiveStellarCore backend #3187

Conversation

bartekn commented Nov 3, 2020 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

tamirms Nov 5, 2020

Choose a reason for hiding this comment

bartekn Nov 9, 2020

Choose a reason for hiding this comment

bartekn Nov 9, 2020

Choose a reason for hiding this comment

tamirms Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

bartekn Nov 9, 2020

Choose a reason for hiding this comment

bartekn Nov 9, 2020

Choose a reason for hiding this comment

bartekn commented Nov 3, 2020 •

edited

Loading

tamirms Nov 5, 2020 •

edited

Loading